For our Final Tutorial, I, Annie Donahue, and Shay Prinz, will be working together. We are planning on working with 911 call data from the years 2019 and 2020, and possibly looking backwards to earlier years if it seems necessary, or if we have the time. We would like to manipulate this data to be able to see how different events/phenomena affect either peoples’ ability to call 911, or their willingness to.
For example, we would like to see what the data can tell us about pre-COVID19 911 calls, vs. post COVID19, by comparing the data February of 2020 with May of 2020, and then using February and May 2019 as a control factor to see if anything changed. We were also thinking about looking at the 911 call data during times of civil unrest where there was a prominent anti-police sentiment in the U.S. Right now, for example, it would be interesting to see if less people are calling the police as protests calling for police abolishment continue across the country. This data can tell us a lot about the current state of New Orleans as a whole as we will be able to see both crime reporting data and 911 calls related to health issues. These datasets encompass a lot of information, and it may be difficult to filter it in a way that will be easier and more digestible so as to see the true effects of the various events we are trying to center our analysis around.
A specific challenge Shay came across when transforming the data was converting from string to datetime. Due to how large this dataset is, it was taking over 10 minutes to complete the conversion. This is because pandas uses dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied. I was able to speed up the process a bit by adding the parameter "infer_datetime_format=True".
We are planning on meeting once a week as a starting point, so as to gauge how much we can get done in one sitting, and make sure we are on a good pace to be able to complete our tutorial in the semester. If need be, we will move up to meeting multiple times a week, especially as we get closer to the end of the semester. Mostly, we will be completing our work on GitHub, and then using the weekly meeting to make a timeline for the next week, and set goals and checkpoints to accomplish before the next time we meet, as well as check in on each others progress and make sure our code and our analyses are compatible.
We would like to plan at least one big goal to accomplish for each week, which is a hard deadline for us both, as well as a stretch goal that we would ideally get done before the next week, but could be moved up to the next week if it became clear that we didn’t have enough time during that week. This is so as to keep us on a steady track, but also work with both of our busy schedules is knowing that there are some weeks where we may have more time and be able to get larger bulk of things done, but even in our busier weeks, we would have to get the set goal done, keeping us on a steady movement towards completing the project. Ideally, our weeks will work around Thursdays, as since we are in the service learning portion of the class, we would like to be able to check in during those times with updates to our service learning partner and our class.
This dataset reflects incidents that have been reported to the New Orleans Police Department in 2019. Data is provided by Orleans Parish Communication District (OPCD), the administrative office of 9-1-1 for the City of New Orleans.
import pandas as pd
#set 'Beat' column dtype to object to avoid mixed dtypes in column
Calls2019_df = pd.read_csv("../Calls_for_Service_2019.csv",low_memory=False)
#Tidy data by dropping columns that won't be used (NOPD_Item, MapX, MapY, TimeClosed, Beat, Location)
del Calls2019_df['NOPD_Item']
del Calls2019_df['MapX']
del Calls2019_df['MapY']
del Calls2019_df['TimeClosed']
del Calls2019_df['Beat']
Calls2019_df[['Latitude','Longitude']] = Calls2019_df.Location.str.split(",",expand=True)
#Remove location column
del Calls2019_df['Location']
#Strip parentheses from left and right side
Calls2019_df['Latitude'] = Calls2019_df['Latitude'].replace(to_replace='\(', value="", regex=True)
Calls2019_df['Longitude'] = Calls2019_df['Longitude'].replace(to_replace='\)', value="", regex=True)
#Calls2019_df.dtypes
#Convert TimeCreate, TimeDispatch, and TimeArrival to datetime
Calls2019_df['TimeCreate'] = pd.to_datetime(Calls2019_df['TimeCreate'],infer_datetime_format=True)
Calls2019_df['TimeDispatch'] = pd.to_datetime(Calls2019_df['TimeDispatch'],infer_datetime_format=True)
Calls2019_df['TimeArrival'] = pd.to_datetime(Calls2019_df['TimeArrival'],infer_datetime_format=True)
#Remove all rows with years that are not 2019
Calls2019_df = Calls2019_df[(Calls2019_df['TimeCreate'].dt.year == 2019)]
#replace missing data
Calls2019_df
This dataset reflects incidents that have been reported to the New Orleans Police Department in 2020. Data is provided by Orleans Parish Communication District (OPCD), the administrative office of 9-1-1 for the City of New Orleans.
Calls2020_df = pd.read_csv("../Call_for_Service_2020.csv")
#Tidy data by dropping columns that won't be used (NOPD_Item, MapX, MapY, TimeClosed, Beat, Location)
del Calls2020_df['NOPD_Item']
del Calls2020_df['MapX']
del Calls2020_df['MapY']
del Calls2020_df['TimeClosed']
del Calls2020_df['Beat']
#Clean up location column to get Latitude and Longitude columns
Calls2020_df['Location'] = Calls2020_df['Location'].replace(to_replace='POINT ', value="", regex=True)
Calls2020_df[['Longitude','Latitude']] = Calls2020_df.Location.str.split(" ",expand=True)
#Remove location column
del Calls2020_df['Location']
#Strip parentheses from left and right side
Calls2020_df['Longitude'] = Calls2020_df['Longitude'].replace(to_replace='\(', value="", regex=True)
Calls2020_df['Latitude'] = Calls2020_df['Latitude'].replace(to_replace='\)', value="", regex=True)
#Calls2020_df.dtypes
#Convert TimeCreate, TimeDispatch, and TimeArrive to datetime
Calls2020_df['TimeCreate'] = pd.to_datetime(Calls2020_df['TimeCreate'],infer_datetime_format=True)
Calls2020_df['TimeDispatch'] = pd.to_datetime(Calls2020_df['TimeDispatch'],infer_datetime_format=True)
Calls2020_df['TimeArrive'] = pd.to_datetime(Calls2020_df['TimeArrive'],infer_datetime_format=True)
#replace missing data
Calls2020_df
A challenge I came across when transforming the data was converting from string to datetime. Due to how large this dataset is, it was taking over 10 minutes to complete the conversion. This is because pandas uses dateutil.parser.parse for parsing the strings when it has a non-default format or when no format string is supplied. I was able to speed up the process a bit by adding the parameter "infer_datetime_format=True".
Calls2019_df = Calls2019_df.copy()
Calls2019_df['Year'] = Calls2019_df['TimeCreate'].dt.year
Calls2019_df['Month'] = Calls2019_df['TimeCreate'].dt.month
CallsbyMonth2019 = Calls2019_df[Calls2019_df['Month'] < 9]
Calls2020_df = Calls2020_df.copy()
Calls2020_df['Year'] = Calls2020_df['TimeCreate'].dt.year
Calls2020_df['Month'] = Calls2020_df['TimeCreate'].dt.month
CallsbyMonth2020 = Calls2020_df[Calls2020_df['Month'] < 9]
CallsbyMonth = CallsbyMonth2019.append(CallsbyMonth2020)
CallsbyMonth = CallsbyMonth.reset_index()
CallsbyMonth["Month"] = CallsbyMonth["Month"].map({
1: "Jan",
2: "Feb",
3: "March",
4: "Apr",
5: "May",
6: "June",
7: "Jul",
8: "Aug"
})
#result
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
result = CallsbyMonth.pivot_table(index='Month',
columns='Year', values='Type', aggfunc='count').apply(lambda x:100*x / float(x.sum())).round(2)
monthOrder = ['Jan', 'Feb', 'March', 'Apr', 'May', 'June', 'Jul', 'Aug']
ax = result.loc[monthOrder].plot(kind="bar", legend=True,
title ="Percent of Calls by Month", figsize=(20, 10), fontsize=16)
plt.xlabel("Month")
plt.ylabel("Percent")
plt.show()
This graph shows the percent of calls made each month in comparison to the total calls made from January to August in both 2019 and 2020. As you can see, the highest proportion of calls for 2020 were made in January and February. This was before quarantine so we cannot say quarantine and an increased number of calls are correlated, which was what we had predicted. For 2019, the highest proportion of calls was made in the month of May.
!pip install folium
import folium
locations = CallsbyMonth2019[['Latitude', 'Longitude']].dropna()
locationlist = locations.values.tolist()
len(locationlist)
from folium.plugins import FastMarkerCluster
NOLA_location = [29.951065, -90.071533 ]
m = folium.Map(location=NOLA_location,
zoom_start=15,
tiles='openstreetmap')
m.add_child(FastMarkerCluster(locationlist, name='2019'))